GenitivDB ― a Corpus-Generated Database for German Genitive Classification

نویسنده

Roman Schneider

چکیده

We present a novel NLP resource for the explanation of linguistic phenomena, built and evaluated exploring very large annotated language corpora. For the compilation, we use the German Reference Corpus (DeReKo) with more than 5 billion word forms, which is the largest linguistic resource worldwide for the study of contemporary written German. The result is a comprehensive database of German genitive formations, enriched with a broad range of intraund extralinguistic metadata. It can be used for the notoriously controversial classification and prediction of genitive endings (short endings, long endings, zero-marker). We also evaluate the main factors influencing the use of specific endings. To get a general idea about a factor’s influences and its side effects, we calculate chi-square-tests and visualize the residuals with an association plot. The results are evaluated against a gold standard by implementing tree-based machine learning algorithms. For the statistical analysis, we applied the supervised LMT Logistic Model Trees algorithm, using the WEKA software. We intend to use this gold standard to evaluate GenitivDB, as well as to explore methodologies for a predictive genitive model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syntactic Analyses for Parallel Grammars: Auxiliaries and Genitive NPs

This paper focuses on two disparate aspects of German syntax from the perspective of parallel grammar development. As part of a cooperative project, we present an innovative approach to auxiliaries and multiple genitive NPs in German. The LFG-based implementation presented here avoids unnessary structural complexity in the representation of auxiliaries by challenging the traditional analysis of...

متن کامل

An Unsupervised System for Identifying English Inclusions in German Text

We present an unsupervised system that exploits linguistic knowledge resources, namely English and German lexical databases and the World Wide Web, to identify English inclusions in German text. We describe experiments with this system and the corpus which was developed for this task. We report the classification results of our system and compare them to the performance of a trained machine lea...

متن کامل

Generating data as a proxy for unavailable corpus data: the contextualized sentence completion task

There is much interest in using large corpora to explore predictors of the probability of higher level linguistic structures, but suitable corpora are not available for all languages and their varieties. We explore a task that uses discourse contexts from an existing corpus as prompts for sentence completion to investigate the usefulness of the method for generating data as a proxy for unavaila...

متن کامل

An XML-based Tool for Tracking English Inclusions in German Text

The use of lexicons and corpora advances both linguistic research and performances of current natural language processing (NLP) systems. We present a tool that exploits such resources, specifically English and German lexical databases and the World Wide Web to recognise English inclusions in German newspaper articles. The output of the tool can assist lexical resource developers in monitoring c...

متن کامل

Markedness and Blocking in German Declensional Paradigms

The loss of regular case endings in modern German has led to highly syncretic noun paradigms that neutralise many of the distinctions retained in more conservative determiner and adjective paradigms. Genitive and dative are, for all intents and purposes, the only cases marked in noun paradigms. Strong nonfeminine nouns have a genitive singular in -s. Strong nouns whose plural ends in a schwa or...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

GenitivDB ― a Corpus-Generated Database for German Genitive Classification

نویسنده

چکیده

منابع مشابه

Syntactic Analyses for Parallel Grammars: Auxiliaries and Genitive NPs

An Unsupervised System for Identifying English Inclusions in German Text

Generating data as a proxy for unavailable corpus data: the contextualized sentence completion task

An XML-based Tool for Tracking English Inclusions in German Text

Markedness and Blocking in German Declensional Paradigms

عنوان ژورنال:

اشتراک گذاری